Interactive Learning Environments
ISSN: (Print) (Online) Journal homepage: https://www.tandfonline.com/loi/nile20
The effectiveness of automated writing evaluation in EFL/ESL writing: a three-level meta-analysis
Thuy Thi-Nhu Ngo, Howard Hao-Jan Chen & Kyle Kuo-Wei Lai
To cite this article: Thuy Thi-Nhu Ngo, Howard Hao-Jan Chen & Kyle Kuo-Wei Lai (2022): The effectiveness of automated writing evaluation in EFL/ESL writing: a three-level meta-analysis, Interactive Learning Environments, DOI: 10.1080/10494820.2022.2096642
To link to this article: https://doi.org/10.1080/10494820.2022.2096642
Published online: 08 Jul 2022. Submit your article to this journal
Article views: 492
Citing articles: 2 View citing articles
INTERACTIVE LEARNING ENVIRONMENTS
https://doi.org/10.1080/10494820.2022.2096642
The effectiveness of automated writing evaluation in EFL/ESL writing: a three-level meta-analysis
Thuy Thi-Nhu Ngo, Howard Hao-Jan Chen and Kyle Kuo-Wei Lai
English Department, National Taiwan Normal University, Taipei City, Taiwan
ABSTRACT
The present study performs a three-level meta-analysis to investigate the overall effectiveness of automated writing evaluation (AWE) on EFL/ESL student writing performance. 24 primary studies representing 85 between-group effect sizes and 34 studies representing 178 within- group effect sizes found from 1993 to 2021 were separately meta- analyzed. The results indicated a medium overall between-group effect size (g = 0.59) and a large overall within-group effect size (g = 0.98) of AWE on student writing performance. Analyses of moderators show that: (1)- AWE is more effective in improving vocabulary usage but less effective in improving grammar in students’ writing; (2)- Grammarly shows potential in being a highly effective tool, though Pigai did not demonstrate such effectiveness; (3)- Medium to long duration of AWE usage leads to a higher effect, but short duration leads to a lower effect in writing outcome compared to non-AWE treatment; (4)- Studying with peers in AWE condition potentially produces a large effect; (5)- AWE is beneficial to students at the undergraduate level, students in the EFL context, and students with intermediate English proficiency. Directions for future research are also discussed in the present study. Overall, AWE is a beneficial application and is recommended for integration in the writing classroom.
Received 24 February 2022
Accepted 23 June 2022
KEYWORDS
Automated writing evaluation; AWE; writing; meta-analysis; effectiveness
Writing is an important skill for students to succeed in school and daily lives (Graham, 2019). To be a competent writer, it is necessary to master both basic processes (e.g. handwriting/typing, spelling) and complex processes (e.g. idea generation and organization, idea transformation into language, writing revision) in writing. This task requires effective instructional practices in the writing classroom so that students can develop their writing ability. One of the beneficial practices is the use of tech- nology in assisting and scaffolding students with the aforementioned processes. With the develop- ment of recent technologies, it is now possible for students to receive technology-based feedback. Automated writing evaluation (AWE) is one of the most widely used application software that pro- vides feedback in writing (Nunes et al., 2021).
AWE is known as a computer tool that can automatically evaluate a written text by providing an overall score and/or feedback on categories such as grammar, mechanics, content, organiz- ation, vocabulary usage, or style (Warschauer & Ware, 2006). Originally, it was developed for the purpose of giving summative scores for written texts. In the last decades, the tool has matured and can now provide detailed automated feedback (Nunes et al., 2021). As a result, the application of AWE has gained popularity in school and university settings as students are
CONTACT Howard Hao-Jan Chen hjchen@ntnu.edu.tw
Supplemental data for this article can be accessed online at https://doi.org/10.1080/10494820.2022.2096642.
© 2022 Informa UK Limited, trading as Taylor & Francis Group
provided with more opportunities to plan, write, and revise written texts with the help of AWE feedback (Cotos, 2014; Grimes & Warschauer, 2010). In EFL/ESL writing classrooms where a large class size is often the norm, the tool can also help reduce the burden of teachers’ workloads by its capacity of providing individual feedback in multiple draft writings (Chen et al., 2017; Warschauer & Ware, 2006).
The purpose of the present study is to thus explore the overall effectiveness of automated writing evaluation (AWE) on EFL/ESL students’ writing outcomes. The following sections provide brief back- ground literature on the arguments for and against the use of AWE in writing classrooms and the need for conducting a meta-analysis on the present topic.
The effectiveness of AWE on student writing performance has sparked much debate among researchers and educators (Hocky, 2019). On the one hand, researchers who promote the use of AWE consider the tool to hold three main benefits. First, the tool has the ability to appropriately evaluate student writing as much as teachers do, but in a much more time- and cost-effective way (Cotos, 2011). Second, students learning motivation and autonomy are facilitated through the learning environment that can support the scaffolding practices throughout the multiple draft- ing process of AWE (Chen & Cheng, 2008; Cotos, 2011). Moreover, students can receive sufficient support from the tool when teachers are not available (Wang, 2013). Third, the integration of AWE into writing instruction is believed to provide a more consistent and objective assessment across the curriculum (Cotos, 2011). Meanwhile, human evaluation is flexible and limited based on student individual differences (Parra & Calero, 2019).
On the other hand, several issues on AWE usage are raised by some researchers as well. One of these issues is the vagueness of the feedback, as AWE offers no concrete suggestions for students to improve their ability in presenting consistent, unified, and relevant messages in the writing (Lai, 2010). Furthermore, AWE feedback is predetermined by the computer programing, which limits its ability to provide rich negotiation of meaning and contributes less to the content development of the writing (Chen & Cheng, 2008). Another counterargument of AWE usage is that it discriminates against students who are less experienced with technology use (Khoii & Doroudian, 2013). These arguments against AWE usage should also be taken into account in writing instruction. However, a more important consideration should be on the overall effectiveness of AWE which is uncovered in the present meta-analysis.
Although there is evidence of its usefulness, the overall effectiveness of AWE is still underexplored. A most recent attempt to synthesize the studies on the effectiveness of AWE is the systemic review in Nunes et al.’s (2021) study. Their review offered much insight into the effectiveness and application of AWE. However, further investigation is still necessary to examine some issues that were not explored in their study. First of all, Nunes et al. (2021) solely reviewed the effectiveness of AWE in school settings (i.e. Grade 1–12), which were scarcely studied in the previous literature. If various education levels had been included in the review (e.g. university level), the study may have provided a more complete picture of the effectiveness of AWE. Hence, their review could only obtain eight studies for which the conclusiveness of the findings might be of question. Second, the authors included studies on L1 together with studies on L2 in their review. This combination might induce some problems when synthesizing the data, as writing in the L1 is different from writing in the L2 at the lexicon level (e.g. word formation, word choice), sentence level (e.g. sentence pat- terns, sentence subject), and passage level (e.g. the choice of writing topic, voice, organization) (Wang, 2012 ). Third, the synthesis of different writing-related measures (e.g. grammar, vocabulary, content, organization) was not discussed. The authors found that the collected studies showed a
positive impact of AWE in at least one writing-related measure, but the overall impact of AWE could be more meaningful if each measure was stated. Finally, since this is a systematic review, the quan- titative findings such as the values of the effect sizes were not explored. As Nunes et al. (2021) aimed at investigating the effectiveness of AWE (i.e. the authors only searched for past studies that had experimental design), this study intends to further conduct a meta-analysis to include the values of effect sizes to better explore the effectiveness of AWE.
Another reason for the necessity of conducting this present meta-analysis is the variation of the effect of AWE across different studies. When the overall writing scores were compared between the experimental group (i.e. the group with the use of AWE) and the control group (i.e. the group with traditional teaching), the calculated Hedges’ g (i.e. a measure of effect size) was small negative (g =
−0.29) in Mørch et al. (2017), moderate positive (g = 0.52) in Rich (2012), but large positive (g > 3) in Hassanzadeh and Fotoohnejad (2021). The Hedges’ g values ranged from negative to positive were also observed in the sub-categories of writing such as grammar, content, or vocabulary among the primary studies (Gao & Ma, 2019; Huang & Renandya, 2020; Liu et al., 2017; Wang et al., 2013). An apparent explanation for the inconsistent findings was the heterogeneity among the studies, meaning potentially different factors (i.e. study features such as outcome measures, AWE tools, dur- ation, etc.) had caused such variation.
To tackle the aforementioned issues, a meta-analysis that can synthesize research findings and provide substantial evidence of effect sizes in the impact of AWE on writing outcomes is needed to understand the overall effectiveness of AWE, as well as to discern the potential account for the variation observed in the effect. Therefore, the present study makes the first attempt to investigate the effectiveness of AWE on EFL/ESL student writing performance following the two research ques- tions below:
RQ1. How effective is AWE?
RQ2. How can the observed variation be accounted for?
Inclusion and exclusion criteria
The present study investigated the effectiveness of AWE in EFL/ESL student writing both in between- group and within-group comparisons. The former means the comparison of writing outcomes between the experimental groups (i.e. the groups that had assistance from the AWE tools to revise their writing) and the control groups (i.e. the groups that experienced the traditional teaching method such as receiving feedback from teachers and/or peers or self-revising the writing drafts without the assistance from the AWE tools). The latter was to compare student writing outcomes in their pre- and post-tests or their original and final drafts after experiencing the treatments from the AWE tools. In certain cases, a study attempting to meta-analyze two types of comparison could offer a more comprehensive picture of an investigated research topic (Lee et al., 2019). There- fore, the present meta-analysis attempted to explore the findings from both types of comparison. The inclusion and exclusion criteria for the collection of the primary studies were established as follows:
The primary studies included in the meta-analysis should be quantitative studies that contained experimental groups and/or control groups.
The effect of AWE on student writing performance was measured in the primary studies, and the parameters such as means, standard deviations, sample sizes, or other statistical values that could offer enough information to transform to the Hedges’ g values must be provided.
The writing performance of students should be assessed based on their writing products. Studies that reported self-perceived performance were excluded.
The studies in which students in experimental groups also received independent feedback from teachers/peers without referring to the AWE tools were excluded unless there were control groups that could help eliminate the independent effect from teachers/peers. The studies that had students use the tools with teachers/peers could be included since the tools were the main source of providing feedback; however, these studies were coded differently as either learning with teachers or with peers.
The AWE tools used in the studies were to analyze English writing texts. Tools that analyzed the writing texts of other languages (e.g. Dutch, Chinese) besides English were excluded.
Finally, only studies written in English were included.
In the first phase, the primary studies were collected through databases such as ProQuest, Wiley Open Library, Taylor & Francis Online, ERIC, and Google Scholar. Some popular SSCI journals related to language learning and technology were further separately accessed to reduce the like- lihood of missing relevant studies, including Computer Assisted Language Learning, Innovation in Language Learning and Teaching, Journal of Computer Assisted Learning, British Journal of Edu- cational Technology, Australasian Journal of Educational Technology, The Journal of the European Association for Computer Assisted Language Learning, CALICO Journal, Language Learning and Tech- nology Journal.
There were three main sets of keywords for paper screening regarding: (set 1)- AWE-related key- words (e.g. automated writing evaluation, AWE, automated, automated feedback, grammar checker, grammar check, English grammar checker, AI and grammar checker, grammar checker and feed- back), (set 2)- writing performance related keywords (e.g. writing outcome, writing performance, essay writing, essay, English writing, writing accuracy, writing skills, writing), and (set 3)- design- related keywords (e.g. pretest, posttest, pre-test, post-test, experiment, experimental, control, quan- titative, original, revised).
The search was conducted by using the keywords of the first set and/or the combination with the keywords in set 2 and set 3. For the keywords that produced several hundreds to thousands of hits, we would read through the first 100 hundred hits or until there were no more potentially relevant papers found. In total, approximately 11,000 potential hits were quickly examined by their titles and/ or abstracts (including duplication), and nearly 450 potentially relevant papers (excluding dupli- cation) were identified for further examination. After applying all the inclusion and exclusion criteria, 38 primary studies were qualified for the collection in the first phase.
The second phase of the paper collection was the reference chasing of the previously 38 collected studies. This phase added six more papers to the final list of the collection, constituting a total number of 44 papers. Of these 44 papers, 24 papers were classified in the list of between-group com- parison studies and 34 papers in within-group comparison. The sum of the number of papers in the two lists was greater than 44 because some papers provided information on either between-group or within-group comparison, but some other papers provided both.
Effect size calculation
The present meta-analysis would separately meta-analyze the two datasets, namely between-group comparison and within-group comparison. The between-group comparison represented the differ- ence between the experimental treatment versus the traditional teaching while the within-group comparison showed the difference of the treatment versus no teaching. Theoretically speaking, a higher effect size would be expected in the within-group comparison because even subpar teaching in the traditional method would normally result in some improvement (Boulton & Cobb, 2017). It is, therefore, essential to separate the two comparisons. Each effect size was calculated using Hedges’ g
for its consideration of effect-size weighting in the small sample size cases included in the present meta-analysis. The equations for calculation are shown below:
Hedges′ g = J
× MeanT − MeanC
T
T
C
C
T
C
(1)
correction factor
(n − 1)SD2 + (n − 1)SD2 /n + n − 2
1 1 Cohen′s d2
in which
SEg = Jcorrection factor ×
+
n
T
nT C
+ 2 × (n
+ nC)
(2)
Jcorrection factor = 1 − 3/(4 × (nT + nC − 2) − 1); Cohen′s d2
✓
T
C
= MeanT − MeanC ; (nT − 1)SD2 + (nC − 1)SD2/nT + nC − 2
MeanT, nT, and SDT respectively represent the mean, sample size, and standard deviation of the treated group; MeanC, nC, and SDC respectively represent the mean, sample size, and standard devi- ation of the comparison group (i.e. either the control group in between-group comparison or the pre-treated group in within-group comparison; see also Boulton and Cobb (2017), Hedges and Olkin (1985), Lee et al. (2019), and Lipsey and Wilson (2001) for references.)
The argument might lie in the choice between comparing post-test effect sizes or learning gain effect sizes. Calculating learning gain effect sizes is still a controversial issue because of the absence of reporting the standard deviation gains in primary studies. Cuijpers et al. (2017) and Harrer et al. (2021) suggested that it is the best practice to avoid calculating learning gains for meta-analysis.
Lee et al. (2019) generated learning gain effect sizes, but about half of the collected studies could be generated; not all studies providing enough information for the generation was also true for the present meta-analysis. In order to generate learning gain effect sizes, the above authors decided to use the post-test standard deviation of the control group as an alternative to the absence of the learning gain standard deviations. However, in real data, it was highly likely the case that learning gain standard deviations were different between the groups. In our data, for example, Choi (2011) and Liu et al. (2017) reported the different learning gain deviations of the examined groups. Even though learning gain effect sizes were generated, Lee et al. (2019) also reported that the difference between the gain effect sizes and the posttest effect sizes was relatively small.
After all of the aforementioned considerations, we decided not to generate learning gain effect sizes for it would not significantly affect the power of the overall effect sizes. Using the original report of the post-test or learning gain parameters presented in the primary studies would simplify the data management process for any replication and avoid the unsolved controversial issue of calculating learning gain effect sizes for a meta-analysis.
Research on the effectiveness of the AWE tools usually compares a number of different variables, then produces several effect sizes within the same study. For example, Huang and Renandya (2020) compared the writing outcomes of the experimental group with the control group in their overall writing score and five writing subcategories including content, organization, vocabulary, language use, and mechanics. In another example, Liu et al. (2017) evaluated student writing per- formance in terms of seven variables (i.e. spelling, grammar, coherence, conclusion, supporting ideas, sentence diversity, and organization). This common occurrence of research in AWE demands researchers to employ a more rigorous analysis method to deal with the effect size depen- dence when conducting a meta-analysis. In our present study, a three-level meta-analysis is adopted because it could take such dependency into account when generating the overall effect size. We also conducted a comparison of the three-level model with the two-level conventional model using our
current data to test whether the three-level model could better explain the data than the two-level model.
The data sets of the pre-calculated effect sizes were inputted into the R software for analysis. The packages used in the present study were metafor (Viechtbauer, 2010), meta (Balduzzi et al., 2019), tidyverse (Wickham et al., 2019), and dmetar (Harrer et al., 2021). More information on the formula, the codes, and the guide for conducting a three-level meta-analysis in R could be found in Harrer et al. (2021). Appendix 3 presents information regarding the two-level and three-level meta-analysis models along with the codes, packages, and steps for conducting the present meta-analysis in R.
Moderators and coding procedure
In light of many other meta-analyses, we adapted the three commonly used categories including publication data (e.g. publication year, publication type), population data (e.g. education level, learn- ing context, proficiency), and treatment data (e.g. outcome measure, AWE tool, duration, feedback target, learning activity) for the moderator investigation (Lee et al., 2019). The description of the examined moderators is presented in Appendix 1. The primary studies underwent multiple coding cycles by the two independent raters to ensure the inter-rater reliability of the codes. The detail of the coding scheme and some pre-determined cases were discussed and agreed upon by the two raters before each rater independently coded the primary studies. The overall average kappa index of 10 moderators in the between-group comparison dataset was 99.16, and the value was 99.42 in the within-group comparison dataset. A few disagreements between the two raters were then discussed to determine the final codes used for the meta-analysis.
Overall effect sizes
level 2
level 3
Table 1 below shows the overall effectiveness of AWE on student writing performance compared to the traditional teaching method. The pooled effect size based on the three-level meta-analytic model was at the medium level (g = 0.59). The 95% CI (0.15; 1.04) was not across zero, indicating the reliable effect of AWE. The result of the Q-test was significant (p < .001), implying the substantial variability in the outcomes of the primary studies and the need for moderator analyses. The esti-
level 3
mated variance values were t2
= 0.72 and t2
= 0.93; I2
= 41.94% of the total variance can
level 2
be attributed to between-study heterogeneity; I2 = 54.18% of the total variance to within-study
1
heterogeneity. The comparison of the three-level model with the two-level conventional model showed a significantly better fit for the three-level model in which the likelihood ratio test was: X2 = 20.18; p < 0.001. Therefore, the application of the three-level model would better explain our between-group comparison data.
Table 2 below presents the overall average effect size of students’ writing performance after
using the AWE tools compared to their own performance before the treatment. The pooled effect size was large [g = 0.98; 95% CI = (0.63; 1.33)]. As of expectation, the within-group overall effect size was larger than the between-group overall effect size. Similarly, the result of the Q-test was sig- nificant (p < .001) for which there was substantial variability in the outcomes of the primary studies
level 3
and the moderator analyses were necessary. The estimated variance values were t2 = 0.98 and
Table 1. Overall average effect size and heterogeneity test results in between-group comparison. Weighted ES 95% CI Heterogeneity
n | g | SE | Lower | Upper | Q | df | p | t2 | I2 | t2 | I2 level 2 | ||
85 | 0.59 | 0.22 | 0.15 | 1.04 | 1288.77 | 84 | <.001 | 0.72 | 41.94% | 0.93 | 54.18% |
level 3
level 3
level 2
Notes: ES = effect size; CI = confident interval; n = the number of effect sizes; g = Hedges’ g standardized mean differences; SE = standard error.
Table 2. Overall average effect size and heterogeneity test results in within-group comparison.
Weighted ES 95% CI Heterogeneity
level 3
n g SE Lower Upper Q df p t2
2
I
level 3
2
t
level 2
2
I
level 2
178 0.98 0.18 0.63 1.33 965.90 177 <.001 0.98 88.14% 0.09 8.13%
Notes: ES = effect size; CI = confident interval; n = the number of effect sizes; g = Hedges’ g standardized mean differences; SE = standard error.
t
2
level 2
= 0.09; I2
= 88.14% of the total variance can be attributed to between-study heterogeneity,
and I
2
level 3
level 2
= 8.13% of the total variance to within-study heterogeneity. A comparison of the three-
1
level model with the two-level model was also conducted. The three-level model in this case also showed a significantly better fit. The result from the likelihood ratio test was: X2 = 60.26; p < 0.001. Thus, conducting a three-level meta-analysis would also better explain our within-group comparison data.
In order to investigate variation within the overall effect sizes, 10 groups of moderators classified in three categories (treatment data, population data, publication data) were examined. A series of mul- tiple meta-regression for each moderator was conducted to explore the effect size of each variable within a subgroup (moderator). The results were presented in Tables 3–5.
Table 3 below reports the results from the moderator analyses to treatment data which includes five subgroups: outcome measure, tool, duration, feedback target, and activity. First, the results from the outcome measure showed that AWE had a medium between-group effect size on students’ overall
Table 3. Moderator analyses in the treatment data.
Between-group comparison Within-group comparison
Treatment data | n | k | g [95% CI] | n | k | g [95% CI] |
1. Outcome Measure | ||||||
(1) Overall Writing | 22 | 16 | 0.73 [−0.12; 1.59] | 27 | 17 | 1.24* [0.89; 1.59] |
(2) Content & Organization | 14 | 6 | 0.74 [−0.01; 1.50] | 17 | 7 | 0.88*** [0.48; 1.28] |
(3) Grammar & Mechanics | 24 | 14 | 0.27 [−0.51; 1.06] | 96 | 18 | 0.86 [0.56; 1.16] |
(4) Vocabulary | 9 | 5 | 0.83 [−0.11; 1.77] | 12 | 4 | 0.99 [0.64; 1.34] |
(5) Style | 2 | 1 | 0.29 [−1.45; 2.03] | 5 | 2 | 0.48 [−0.01; 0.97] |
(6) Text Complexity | 14 | 4 | 0.58 [−0.35; 1.52] | 20 | 8 | 0.59 [0.29; 0.88] |
(7) Vocabulary + Style | – | – | – | 1 | 1 | 0.96 [0.05; 1.86] |
2. Tool | ||||||
(1) Criterion | 18 | 4 | 0.34 [−0.81; 1.49] | 112 | 10 | 0.93 [−1.70; 3.56] |
(2) Grammarly | 9 | 4 | 1.04* [−0.21; 2.29] | 5 | 5 | 1.86 [−0.91; 4.63] |
(3) Pigai | 25 | 4 | 0.09 [−1.03; 1.20] | 21 | 7 | 0.89 [−1.79; 3.58] |
3. Duration | ||||||
(1) Long (≥ 10 weeks) | 25 | 9 | 0.71* [0.01; 1.40] | 20 | 10 | 1.15*** [0.61; 1.69] |
(2) Medium (3–9 weeks) | 30 | 9 | 1.13 [0.17; 2.08] | 12 | 6 | 1.07 [0.08; 2.06] |
(3) Short (≤ 2 weeks) 4. Feedback Target | 30 | 7 | −0.20 [−1.18; 0.79] | 146 | 19 | 0.87 [0.31; 1.42] |
(1) Global feedback (on content & organization) | 9 | 3 | 1.47** [0.37; 2.57] | 7 | 3 | 1.24* [0.01; 2.47] |
(2) Local feedback (on grammar & vocabulary) | 21 | 10 | 0.78 [−0.51; 2.06] | 169 | 13 | 0.98 [−0.39; 2.34] |
(3) Mixed feedback 5. Activity | 55 | 11 | 0.19* [−1.03; 1.41] | 84 | 18 | 0.95 [−0.38; 2.28] |
(1) Alone | 45 | 15 | 0.52 [−0.04; 1.08] | 94 | 24 | 1.10*** [0.65; 1.54] |
(2) With peer | 13 | 3 | 1.04 [−0.30; 2.37] | 55 | 2 | 0.89 [−0.70; 2.48] |
(3) With teacher | 26 | 6 | 0.53 [−0.32; 1.38] | 25 | 6 | 0.60 [−0.38; 1.59] |
Notes: CI = confident interval; n = the number of effect sizes; k = the number of studies; g = Hedges’ g standardized mean differ- ences; *p < .05; **p < .01; ***p < .001.
Table 4. Moderator analyses in population data.
Between-group comparison Within-group comparison
Population data | n | k | g [95% CI] | n | k | g [95% CI] |
6. Education Level | ||||||
(1) Secondary | 3 | 3 | 0.25 [−2.35; 2.86] | 4 | 4 | 0.51 [−1.32; 2.35] |
(2) Undergraduate | 77 | 19 | 0.62 [−1.53; 2.77] | 169 | 28 | 1.05 [−0.49; 2.59] |
(3) Post-graduate | 1 | 1 | 0.40 [−3.00; 3.80] | 5 | 2 | 0.96 [−0.53; 2.44] |
(4) Institute | 4 | 1 | 1.15 [−0.94; 3.23] | – | – | – |
7. Context | ||||||
(1) EFL | 71 | 22 | 0.62** [0.16; 1.09] | 112 | 29 | 0.98*** [0.58; 1.37] |
(2) ESL 8. Proficiency | 14 | 3 | 0.38 [−0.74; 1.50] | 66 | 5 | 1.02 [0.02; 2.02] |
(1) Basic | 6 | 1 | 1.00 [−2.44; 4.44] | 4 | 2 | 2.85* [1.42; 4.29] |
(2) Intermediate | 18 | 7 | 1.03 [−1.92; 3.98] | 62 | 13 | 1.30 [1.12; 1.49] |
(3) Advanced | 1 | 1 | 2.00 [−0.81; 4.81] | 32 | 3 | 1.25*** [0.62; 1.88] |
(4) Mixed | 30 | 5 | 0.23 [−2.73; 3.20] | 16 | 5 | 0.55 [−0.65; 1.75] |
(5) Basic + Intermediate | – | – | – | 21 | 3 | 0.60 [−0.84; 2.04] |
Notes: CI = confident interval; n = the number of effect sizes; k = the number of studies; g = Hedges’ g standardized mean differ- ences; *p < .05; **p < .01; ***p < .001.
writing (g = 0.73). In terms of the examined subcategories of writing, a large between-group effect was found in vocabulary (g = 0.83) and a medium effect in content and organization (g = 0.74) and text complexity (g = 0.58). However, a small between-group effect size was observed in the case of grammar and mechanics (g = 0.27). Data from within-group comparison showed large effect sizes in overall writing (g = 1.24) and the three writing subcategories including content and organization (g
= 0.88), grammar and mechanics (g = 0.86), and vocabulary (g = 0.99). The within-group effect size of text complexity was still at a medium level (g = 0.59). There was a shortage of studies on style in both between-group comparison (k = 1) and within-group comparison (k = 2).
Second, regarding moderator analysis of the tools, there were 18 different AWE tools investigated in the literature. Therefore, we decided to report the results of the three most commonly used tools (e.g. Criterion, Grammarly, Pigai). These tools needed to be included in at least more than one study. The list of all the AWE tools with their key function and effect sizes are presented in Appendix 4. The result of moderator analysis on tools showed a small between-group effect size of Criterion (g = 0.34) and a large between-group effect size of Grammarly (g = 1.04). Pigai did not show a differential effect on improving student writing performance compared to the traditional teaching method (g = 0.09). In within-group comparison data, the effect size of Grammarly (g = 1.86) was twice as large as other tools (e.g. Criterion, Pigai).
Third, the analysis on duration indicated a large between-group effect size in medium duration (g = 1.13) and a medium effect size in long duration (g = 0.71). Short duration produced a small negative between-group effect size (g = −0.20). Data from within-group comparison revealed large effect sizes in long duration (g = 1.15), medium duration (g = 1.07), and short duration (g = 0.87).
Table 5. Moderator analyses in publication data.
Between-group comparison within-group comparison
Publication data | n | k | g [95% CI] | n | k | g [95% CI] |
9. Publication Type | ||||||
(1) SSCI/ESCI | 34 | 11 | 0.39 [−1.11; 1.88] | 120 | 16 | 1.01 [0.04; 1.97] |
(2) General journal | 33 | 8 | 0.91 [−0.61; 2.44] | 29 | 9 | 0.75 [−0.33; 1.84] |
(3) Conference paper | 11 | 3 | 0.69 [−0.64; 2.02] | 27 | 7 | 1.42 [0.62; 2.23] |
(4) Dissertation/Thesis 10. Publication Year | 7 β [95% | 2 CI] 0.04 [ | 0.17 [−1.87; 2.21] −0.05, 0.12] | 2 β [95% CI] | 2 0.09 [− | 0.25 [−1.47; 1.97] 0.01; 0.19] |
Notes: CI = confident interval; n = the number of effect sizes; k = the number of studies; g = Hedges’ g standardized mean differ- ences; *p < .05; **p < .01; ***p < .001.
Fourth, concerning feedback target, studies utilizing global feedback offered by the AWE tools resulted in a large between-group effect size (g = 1.47), while it was at a medium level (g = 0.78) with studies targeting local feedback. The mixed feedback type from the AWE tools did not over- weigh the traditional teaching method on improving student writing performance since the effect size was negligible (g = 0.19). The within-group effect sizes were large in three types of feed- back target including global feedback (g = 1.24), local feedback (g = 0.98), and mixed feedback (g = 0.95).
The last subgroup analysis from the treatment data was activity. The results showed a large between-group effect size of AWE intervention when students learned with their peers (g = 1.04). The effect sizes were medium in cases of learning with teachers (g = 0.53) or alone (g = 0.52). In the case of within-group comparison, learning with teachers only produced a medium effect size (g = 0.60) which was the lowest compared to large effect sizes produced in learning alone (g = 1.10) and learning with peers (g = 0.89).
Table 4 below presents moderator analyses from population data including education level, context, and proficiency. Regarding education level, a majority of past research focused on the undergradu- ate level, and the effect size was medium in between-group comparison (g = 0.62) and large in within-group comparison (g = 1.05). Several attempts made on secondary school level showed a pre- liminary small between-group effect size (g = 0.25) and medium within-group effect size (g = 0.51). There was a lack of research on other levels with the number of studies less than three.
In terms of context, past research mostly investigated the effectiveness of AWE with EFL students. The result indicated a medium between-group effect size (g = 0.62) and a large within-group effect size (g = 0.98) in EFL students. A few investigations on ESL students showed a small between-group effect size (g = 0.38) in ESL students, while the effect size is still large (g = 1.02) in the case of within- group comparison.
Lastly, concerning proficiency data, intermediate and mixed proficiency levels were the two most investigated populations in the literature. While a group of students of intermediate proficiency largely benefited from AWE (g = 1.03), a group of mixed proficiency levels only received a relatively small effect (g = 0.23) in comparison to the effect from traditional teaching. Data from within-group comparison showed a large effect size at the intermediate level (g = 1.30), and the effect was medium at the mixed level (g = 0.55). There was a shortage of studies at the basic proficiency level (k = 1 in both between- and within-group comparisons). Also, relatively few studies attempted to compare the effect of AWE with traditional teaching in students of advanced proficiency (k = 1). Three within-group studies found for the meta-analysis at the advanced level produced a large- pooled effect size (g = 1.25).
Table 5 presents the results from the moderator analyses of publication data which includes publi- cation type and publication year. In publication type, the moderator analysis showed that the studies submitted to the high impact journals (e.g. SSCI/ESCI) were more likely to report a small between- group effect size (g = 0.39) of AWE, while those submitted to the more general or lower impact jour- nals would report a large between-group effect size (g = 0.91). Other publication sources such as conference papers or dissertation/thesis also showed a different tendency in which the reported effect sizes were respectively medium (g = 0.69) and negligible (g = 0.17). The within-group effect sizes were large in SSCI/ESCI (g = 1.01) and conference paper (g = 1.42), medium in general journal (g = 0.75), and small in dissertation/thesis (g= 0.25). Regarding publication year, it did not change the overall relationship between AWE and student writing performance both in between-group (β = 0.04, p > .05) and within-group (β = 0.09, p > .05) comparisons.
Our meta-analysis uncovers some variation in the effectiveness of AWE on student writing perform- ance. With a current data set based on 85 between-group effect sizes from 24 studies and 178 within- group effect sizes from 34 studies, it enabled us to sufficiently account for the variation. In this section, we attempt to present the overall picture by addressing the two research questions. Below are the main findings of the present meta-analysis:
AWE has positive treatment effects on student writing compared to both non-treatment and non-AWE treatment conditions.
AWE had the most consistent and large effect on vocabulary, which refers to the word usage in writing. However, it had the smallest effect on grammar (language use) and mechanics (spelling and punctuation).
Grammarly’s performance indicated it to be the most efficient tool in assisting writing, while
Pigai did not show the expected effect.
Medium and long duration of AWE treatment (i.e. more than two weeks) showed higher impact on writing outcome compared to non-AWE treatment conditions, but short duration (i.e. less than or equal to two weeks) showed lower impact.
Studying with peers in the AWE condition potentially produced the largest effect.
Current AWE instruction is more beneficial to undergraduate students but less to secondary students.
AWE is more beneficial to EFL students than ESL students.
AWE showed a large effect on students at the intermediate proficiency level.
How effective is AWE?
With a medium overall between-group effect size (g = 0.59) and large overall within-group effect size (g = 0.98), AWE shows positive effects in improving student writing performance. These outcomes support the integration of AWE into the writing classroom. However, the variation in the effect size indicate that there is still room for AWE to develop, especially when compared to the traditional teaching method. Some suggestions can be offered by examining the results of the moderator ana- lyses as discussed in the following section.
How can any observed variation be accounted for?
It is essential for any meta-analytic study to investigate “what works for whom in what circumstances and in what respects, and how” (Pawson & Tilley, 2004, p. 151). In other words, it requires researchers to identify the degree of contribution of different moderator variables to the overall effects (Boulton & Cobb, 2017). Taking these viewpoints into account, the below discussion attempts to present: (1)- in what respects can AWE be beneficial; (2)- which AWE tools are more efficient; (3)- how to use AWE
more effectively; (4)- to whom AWE usage is more effective. In addition, the variation observed in the publication data (e.g. publication type, publication year) and the design of post-tests in the collected studies are also discussed to understand the potential publication bias and design-related issues.
First, the findings show the high benefit of AWE on improving vocabulary (e.g. word choice/ usage) in writing (gbetween = 0.83; gwithin = 0.99). This would indicate that the AWE can provide feed- back on various vocabulary options for students to use to enrich their writing (Shang, 2019). More- over, learning new vocabulary could be less challenging as opposed to learning grammar according to many teachers and scholars (Coady & Huckin, 1997).
An unexpected finding is on the small between-group effect size of AWE on grammar and mech- anics (g = 0.27). Grammar refers to the language use in writing, such as whether a sentence is used
correctly, and the mechanics relates to the word spelling and punctuation. To explore the possible reasons for the low effect, an examination of the primary studies revealed that AWE produced neg- ligible or small effect sizes when the studies employed less efficient tools such as Pigai and Jukuu (located at http://www.pigai.org/ and was a part of Pigai) (see, e.g. Hu & Zhao, 2015; Shang, 2019). Confirmation of the results through the present meta-analysis indicated that the negative effect sizes occurred when studies had short (see, e.g. Gao & Ma, 2019; Liu et al., 2017) or medium intervention time (see, e.g. Choi, 2010; Choi, 2011). Large effect sizes of AWE would be observed in studies that solely investigated on mechanics (see, e.g. Choi, 2011) or studies with long intervention time and use of other tools (see, e.g. Barrot, 2021; Ghufron & Rosyida, 2018; Wang et al., 2013) besides Pigai or Jukuu. These findings provide sufficient information for a con- clusion that using adequate AWE tools for a longer period is required to improve students’ grammar in writing. This also aligns with the aforementioned challenges students face in learning grammar stated by many scholars.
Second, regarding the efficacy of different AWE tools, Grammarly came out as being the most effective in improving students’ writing performance. In agreement with O’Neill and Russell (2019), some explanations for the effectiveness of Grammarly could be (1)- its ability to offer teachers the opportunity to use both indirect feedback (e.g. highlighting the errors) and direct feedback (e.g. giving explicit corrections), (2)- its ability to provide not only extensive feedback (e.g. the program can determine 250 breaches in grammatical rules) but also focused feedback (e.g. providing cues to address high-frequency errors), and (3)- its easy-to-use system. However, the limitation of Grammarly is that it has not offered global writing feedback (e.g. feedback on content and organization, feed- back on discourse level). In this case, Criterion can be an option for providing global feedback since it can produce a slightly higher effect than the traditional teaching method. Unlike Grammarly, Pigai did not show any superior effect to traditional teaching. As studied by Gao (2021), Pigai could not diagnose the essay as well as teachers did. The system is only able to diagnose word-related errors but inadequately identify the language errors in all aspects, and the system’s suggestions about syntactic use also lack in quality.
Third, the response to the question on how to use AWE more effectively is discussed in three respects; duration, feedback target, and activity. In the duration moderator analysis, short duration (i.e. less than or equal to two weeks) does not offer a desirable outcome in comparison to traditional teaching (g = −0.20). This can be explained from the perspective of Dekeyser’s skill acquisition theory (2007). Initially, the AWE feedback system had students become aware of grammatical or other writing- related rules (i.e. presentation of declarative knowledge). Then, students will be offered the opportu- nities to internalize the AWE feedback through multiple cycles of revising the written texts (i.e. the practice of procedural skills). According to the theory, it requires a longer time and more practice of procedural skills for a specific language skill, like producing new texts, to become automatic (Liao, 2016b). Therefore, it is understandable that the effect of a short duration is much less observable.
Another possible interpretation regarding the low effect of short-duration studies can be that stu- dents lacked sufficient amount of time to effectively learn to use the new AWE tools in a short period of time. Among short-duration studies collected in the present meta-analysis, most of them did not report on the tool-training session except in Xie et al. (2020), which had relatively short training lasting only 30 min. In addition, students’ reflection on their degree of familiarity with using the AWE tools was also not reported. Therefore, while it is possible that the tool-learning time can be a confounding variable on the effectiveness of AWE, it may be that either the tools are easy to use or that students were already familiar with using them for which no substantial training was needed. Since there was little evidence in the past short-duration studies discussing about the potential impact of tool-learning time, future investigation may be necessary. Researchers can incor- porate qualitative investigation when studying students’ perception on whether any difficulties were encountered when using the AWE tools or observe how much time students would need to use the tools proficiently. Proper software training time can then be allotted to better facilitate stu- dents’ learning process, as well as designing a suitable pre-training session for future short-duration
studies related to the effectiveness of AWE tools. This can control for the confounding effect of stu- dents’ proficiency in using the new tools.
In contrast, a medium duration produces a large effect on writing outcomes in comparison with traditional teaching (g = 1.13), and a long duration produces a medium effect (g = 0.71). Based on the viewpoint of the skill acquisition theory discussed above, it is not surprising when medium and long duration show a higher impact on writing outcomes. The question is on the lower impact of long duration compared to medium duration. Returning to the discussion on the effectiveness of different AWE tools, the first possible explanation may be due to the tool being a significant mod- erator, which could lower the effect in long-duration studies but not the duration itself. Data from the collected studies show that 14 out of 25 effect sizes of long-duration between-group studies are from those using Pigai and Jukuu (a part of Pigai), while these two inefficient tools are not used in any of the medium-duration studies.
In terms of feedback target, high efficacy is found when both global feedback (e.g. feedback on content and/or organization) or local feedback (e.g. feedback on vocabulary and/or grammar) of the AWE tools is applied. When mixed type of feedback (i.e. feedback on both global and local levels) is used, the effect size is found to be negligible (g = 0.19) compared to traditional teaching. Again, we believe the low effect is caused by the AWE tools. Among 11 studies with 55 effect sizes on mixed feedback, 5 studies with 29 effect sizes were constituted by Pigai and Jukuu. Only two studies that produced a large effect size with mixed feedback type used My Access! and WhiteSmoke. Criterion’s effects ranged from negative to large depending on the learning context, and the negative effects only occurred in the ESL context (Choi, 2010, 2011). To support our potential interpretation, we per- formed a multimodel interference of 10 examined moderators in between-group comparison data. The result indicated that the most significant and important predictor of the effect was the tools with a coefficient estimate of 0.91, and other moderators had coefficient values lower than 0.50 (see Appendix 2). Therefore, it can be inferred that the effect AWE in writing will be significantly influenced if there is a change in the type of the AWE tools.
Regarding activity, a few studies investigating learning with peers showed an initial overall large effect in both between- and within-group comparisons (g = 1.04; g = 0.89 respectively). This could be a promising finding for many teachers because peer review activities with the help of AWE can reduce teachers’ workload and shorten the time of delivering feedback to students (Huang & Renan- dya, 2020). However, more future AWE studies with peer review activities can be conducted to verify our current findings. Most of the current collected studies had students independently use AWE tools and the results revealed a medium effect size in between-group comparison (g = 0.52). In spite of this, the within-group comparison is both large and significant (g = 1.10, p < .001). This indi- cates a high consistent efficacy of AWE when students study alone regardless of the context. Teacher review activity with the help of AWE seems to be the least efficient compared to peer review and independent study, though the effect is sufficient (i.e. at the medium level) in both between (g = 53) and within-group (g = 0.60) comparisons. Many scholars agree that writing teachers may encoun- ter great difficulty instructing students when the class size is large (Chen et al., 2017; Huang & Renan- dya, 2020; Warschauer & Ware, 2006). Students are suggested to seek help from other sources (e.g. peers) or learn to gradually become independent learners within an AWE learning condition.
Fourth, in answering the question to whom AWE usage is most effective, we looked at the edu- cation level results, context, and proficiency of the students. The data showed that the majority of AWE studies focused on the undergraduate education level (19/24 and 28/34 studies respectively in between- and within-group comparisons) and EFL context (22/24 and 29/34). Therefore, the effect sizes of these two types of population data were similar in the overall effect sizes (i.e. medium effect size in between-group comparison and large effect size in within-group comparison). The variation in these two effects should also be similar to our previous discussion on the overall results. A point to consider is the overall small between-group effect size (g = 0.25) and medium within-group effect size (g = 0.51) from a few studies on secondary school students. It is likely that secondary school students do not benefit as much as other student levels (e.g. undergraduate
students). Furthermore, AWE seems to have a larger impact on EFL than ESL students in between- group comparisons (g = 0.62 as opposed to g = 0.38). More future research and exploration on sec- ondary school and ESL students can be conducted to verify these preliminary findings.
Regarding English proficiency level, most of the primary studies investigated students with inter- mediate or mixed level proficiency, a commonly found population in undergraduate school. Only studies with students at the intermediate level showed large effect sizes both in between- and within-group comparisons (g = 1.03, g = 1.12 respectively). However, a small between-group effect size (g = 0.23) and a medium within-group effect size (g = 0.55) are present in the mixed English level class. Notably, all of the studies targeting the mixed proficiency level also employed mixed feedback from the AWE tools, except for two studies creating two effect sizes of Hoang (2019) and Jayavalan and Razali (2018). Similar to the discussion in the feedback target session, we hold the view that the lower effects may be caused by the AWE tools and not necessarily by the mixed type of feedback or English proficiency.
In order to identify whether any potential publication bias may be present in the meta-analysis, we conducted moderator analyses on the publication data (e.g. publication type, publication year). Albeit there is no significant difference among the variables in the publication type (p > 0.05), a dis- tinct between-group effect size can be observed between SSCI/ESCI category journals and general journals. The former showed a small overall effect size (g = 0.39), while the effect is large in the latter (g = 0.91). From our data collection, those studies published in SSCI/ESCI journals with mostly nega- tive reports of AWE typically employed Pigai (15/34 between-group effect sizes) and self-established AWE systems (11/34 between-group effect sizes). While two long-duration studies using effective tools (e.g. Grammarly, Criterion) showed large between-group effect sizes of AWE (see, e.g. Barrot, 2021; Hassanzadeh & Fotoohnejad, 2021). Hence, the selected tools may have resulted in a low effect found in SSCI/ESCI journals instead of the different publication types. Moderator analyses on publication year indicated an unchanged relationship with years of publication in the overall AWE effect sizes (βbetween = 0.04; βwithin = 0.09; p > 0.05 in both cases). Therefore, our analyses of the publication data indicate that publication bias does not have a clear impact on our current data. Our final discussion is on the design of the post-tests in the studies collected for the present meta- analysis. One common aspect is that all the post-tests are direct writing production tests, meaning participants need to write paragraphs or essays on the given topics. This is an essential characteristic in the post-tests since the aim of the present study is to examine the effectiveness of AWE on stu- dents’ writing outcomes. However, there are some variations in the design of the post-tests. Firstly, while approximately half of the collected studies share the same writing topics for both pre- and post-tests, others provide different topics for the pre- and post-tests but with similar topic genre (e.g. expository genre). Having the same writing topics for both pre- and post-tests might produce more comparable texts but face higher pre-test sensitization effect as students’ cognitive
gains would be largest with similar tests (Willson & Putnam, 1982).
Secondly, most of the primary studies used intervention related to researcher-developed tests in the post-tests. A few studies used the writing topics from standardized tests, but participants were already familiar with those topics because they were taught the same topic genre during their learn- ing process. The topics of the standardized tests may also be embedded in the AWE tools such as Criterion and Pigai used in their practice (see, e.g. Chang et al., 2021; Hoang, 2019; Huang & Renan- dya, 2020; Li et al., 2015; Wang, 2013). Hence, a practice effect may have played a role in the primary studies. The practice effect might not be prominent in the between-group comparison because the control groups needed to write the same topics as the experimental groups. The effect sizes found in the within-group comparison might be interpreted with caution, but the comparison among studies should be plausible as a practice effect is present in all the primary studies.
Lastly, most of the primary studies’ post-tests were conducted in class or laboratory condition,
while a few of the studies allowed participants to complete the post-test writing after class (see, e.g. Liu et al., 2017; Lu & Li, 2016; Ohta, 2008; Xie et al., 2020). Indeed, studies having participants complete the post-tests outside of the laboratory condition have high risks of internal validity.
However, whether all study variables under the laboratory condition should be controlled or not can be a challenging decision for social science researchers to make, because the attempt to strengthen internal validity would also weaken external validity and vice versa (Nunan, 1992). In other words, what occurs in the laboratory condition may not occur under typical circumstances (e.g. in a general classroom where some teachers would have writing assignments as students’ homework). Therefore, we decided to acknowledge the contribution of both conditions and include all of those studies in the analysis. Furthermore, because having students complete the post-tests after class accounted for a small proportion in the current meta-analysis (4 out of 44 collected studies), its influence on the overall effect sizes should be minimal.
For a more thorough meta-analysis, it could be useful to conduct moderator analyses on the design data to explore potential varying effects caused by the study design. For example, from our discussion on the design of the post-tests, it may be beneficial to examine the potential of different effect sizes between studies with the same writing topics and studies with different writing topics in the pre and post-tests. Exploration on studies with post-tests conducted in different contexts (under the classroom/laboratory condition or outside the classroom) can be useful as well. Nevertheless, the goal of our present meta-analysis is to address possible pedagogical implications with AWE in EFL/ESL writing performance. Future researchers who wish to contribute to the methodological implications could pay more attention to having the design data variables as moderators in a meta-analysis.
A noticeable limitation in our study is that we did not examine the effects of missing data or biases in our meta-analysis by conducting several popular tests such as the Egger’s regression test, fail-safe N tests, or trim-and-fill method. Similar to Assink and Wibbelink (2016), we found no evaluation or official guideline of the available methods for managing missing data or biases in a three-level meta-analytic study. A possible alternative is to apply the available methods by examining the biases in the same manner as handling a two-level conventional meta-analysis. However, the results of the overall effect sizes from the two models (e.g. three-level and two-level models) were different. In our present meta-analysis, the difference is also significant as stated in the result section. Therefore, similar approaches used for a two-level conventional meta-analysis might not appropriately explain or handle our missing data.
Another alternative is to calculate the average effect size of each independent study, then use all the calculated average effect sizes as the dataset for handling biases or missing data. This method by nature would create more bias as there were already potential biases present between different the effect sizes within a study. Harrer et al. (2021) also presented outcome reporting bias as one of the sources of bias in meta-analysis. The authors contended that many of the studies with multiple out- comes would drop the negative findings out of their report and only keep the positive findings, then produce the outcome reporting bias. In sum, including all the different effect sizes within a study can better explain the missing data instead of using only the average effect size of an entire study.
To conclude, due to the lack of available statistical methods that can directly identify publication bias (Harrer et al., 2021) and the lack of evaluation of the available methods in a three-level meta- analysis (Assink & Wibbelink, 2016), we were more interested in analyzing and interpreting the effect sizes found from our current data. Potential publication bias may be indirectly observed from our moderator analysis in the publication data. Handling the missing data through a three- level meta-analysis is our current limitation.
Conclusion and future directions
The present study attempts to meta-analyze the studies on the effectiveness of AWE on EFL/ESL student writing performance. Overall, AWE has a medium effect size compared to traditional teaching
and largely facilitates within-subject development. The findings support the application of AWE in the writing classroom. The present study also provides other suggestions on how to best apply AWE to writing instruction. First, AWE is indeed effective in improving student’s use of vocabulary, but it may need a longer time frame to improve student’s grammar in writing. Second, the choice of tool should be carefully considered as it would significantly affect the learning outcome. Third, the duration of AWE use should be medium to long term use. A short duration (i.e. less than 3 weeks) would have little to no effect. Fourth, peer review activities can be encouraged to better facilitate students’ writing performance. Finally, the current AWE application is more beneficial to students in undergraduate school, students studying in an EFL context, and students with an intermediate proficiency level.
Some directions for future investigation can be gathered from this meta-analysis. First, we noticed that past studies on AWE rarely designed delayed post-tests to examine the retention effect of AWE on writing performance. Future experimental researchers are recommended to attend to designing delayed posttest so that a more comprehensive picture of the effectiveness of AWE can be discussed. Second, the tools that provided both global and local feedback such as My Access! or WhiteSmoke can be worth investigating in future research. As previously discussed, Grammarly is an effective tool but is limited to only providing local feedback. Criterion and Pigai, which can provide both global and local feedback, did not show an overall large effect. However, two studies comparing the effective- ness of My Access! (Khoii & Doroudian, 2013) and WhiteSmoke (Toranj & Ansari, 2012) produced large effect sizes compared to traditional teaching respectively. This would call for future investigations to verify these findings. Third, future research on AWE can design peer review activities so that the effectiveness of studying with peers in an AWE condition can be more conclusive. Moreover, this will provide further evidence for teachers in deciding the learning activities for students.
Another direction can be the investigation on the differential effectiveness of direct versus indir- ect feedback or extensive versus focused feedback produced by the tools. In the present meta-analy- sis, we were not able to examine these aspects due to the lack of relevant information from the primary studies. For example, only two studies reported the use of indirect feedback (see, e.g. Choi, 2011; Liu et al., 2017), while many papers only reported the target writing aspects (e.g. voca- bulary, grammar, organization, content) that the tools offered without any clues of whether the treated feedback was indirect, direct, or both. Similarly, there was not much information regarding extensive and focused feedback in the primary studies. These levels of feedback are worth investi- gating in future studies to contribute to the effective development of AWE tools.
Some other areas that lack investigation such as the effect of AWE on writing style or students with basic or advanced English proficiency can be further explored in the future as well. Lastly, regarding research methodology, a worthy direction for future research is to evaluate the available methods in handling missing data or biases through a three-level meta-analysis. Furthermore, con- sidering design data variables as moderators in a meta-analysis could also contribute to the under- standing of the effects of different research designs.
No potential conflict of interest was reported by the author(s).
Funding
This work was supported by the Ministry of Science and Technology in Taiwan [grant number MOST110-2923-H-003
-002 -MY2].
Thuy Thi-Nhu Ngo is a doctoral student of the English Department at National Taiwan Normal University, Taipei, Taiwan. Her research interests include computer-assisted language learning and meta-analysis.
Howard Hao-Jan Chen is distinguished Professor of the English Department at National Taiwan Normal University, Taipei, Taiwan. Professor Chen has published several papers in CALL Journal, ReCALL Journal, and several related language learning journals. His research interests include computer-assisted language learning, corpus research, and second language acquisition.
Kyle Kuo-Wei Lai is a doctoral student of the English Department at National Taiwan Normal University, Taipei, Taiwan. His research interests include computer-assisted language learning and digital game-based language learning.
Howard Hao-Jan Chen http://orcid.org/0000-0002-8943-5689
References marked with an asterisk indicate studies included in the meta-analysis.
*Al-Mofti, K. (2020). The effect of using online automated feedback on Iraqi EFL learners’ writings at university level.
Journal of College of Education for Women, 31(3), 1–14. https://doi.org/10.36231/coeduw/vol31no3.12.
*Aluthman, E. (2016). The effect of using automated essay evaluation on ESL undergraduate students’ writing skill.
International Journal of English Linguistics, 6(5), 54–67. https://doi.org/10.5539/ijel.v6n5p54.
Assink, M., & Wibbelink, C. (2016). Fitting three-level meta-analytic models in R: A step-by-step tutorial. The Quantitative Methods for Psychology, 12(3), 154–174. https://doi.org/10.20982/tqmp.12.3.p154.
Balduzzi, S., Rücker, G., & Schwarzer, G. (2019). How to perform a meta-analysis with R: A practical tutorial. Evidence Based Mental Health, 22(4), 153–160. https://doi.org/10.1136/ebmental-2019-300117.
*Barrot, J. (2021). Using automated written corrective feedback in the writing classrooms: Effects on L2 writing accuracy.
Computer Assisted Language Learning, 1–24. https://doi.org/10.1080/09588221.2021.1936071.
Boulton, A., & Cobb, T. (2017). Corpus use in language learning: A meta-analysis. Language Learning, 67(2), 348–393. https://doi.org/10.1111/lang.12224.
*Chang, T.-S., Huang, H.-W., Li, Y., & Whitfield, B. (2021). Exploring EFL students’ writing performance and their acceptance of AI-based automated writing feedback. 2021 2nd International Conference on Education Development and Studies (ICEDS 2021), March 9–11, Hilo HI, USA. ACM, New York, NY, USA, 5 pages. https://doi.org/10.1145/3459043.3459065 Chen, C.-F. E., & Cheng, W.-Y. E. (2008). Beyond the design of automated writing evaluation: Pedagogical practices and perceived learning effectiveness in EFL writing classes. Language Learning & Technology, 12(2), 94–112. https://doi.
Chen, H.-J., Cheng, H.-W., & Yang, T.-Y. (2017). Comparing grammar feedback provided by teachers with an automated writing evaluation system. English Teaching & Learning, 41(4), 99–131. https://doi.org/10.6330/ETL.2017.41.4.04.
*Cheng, G. (2017). The impact of online automated feedback on students’ reflective journal writing in an EFL course. The Internet and Higher Education, 34, 18–27. https://doi.org/10.1016/j.iheduc.2017.04.002.
*Choi, J. (2010). The impact of automated essay scoring (AES) for improving English language learner’s essay writing (Publication No. 3437732) [Doctoral dissertation]. University of Virginia. ProQuest Dissertations Publishing. http:// pqdd.sinica.edu.tw/doc/3437732
*Choi, J. (2011). Integration of computerized feedback to improve interactive use of written feedback in English writing class. Educational Technology International, 12(2), 71–94.
Coady, J., & Huckin, T. (1997). Second language vocabulary acquisition: A rationale for pedagogy. Cambridge University Press.
Cotos, E. (2014). Automated writing evaluation. In E. Cotos (Ed.), Genre-based automated writing evaluation for L2 research writing (pp. 1–64). Palgrave Macmillan.
*Cotos, E. (2011). Potential of automated writing evaluation feedback. CALICO Journal, 28(2), 420–459. https://doi.org/
Cuijpers, P., Weitz, E., Cristea, I., & Twisk, J. (2017). Pre-post effect sizes should be avoided in meta-analyses. Epidemiology and Psychiatric Sciences, 26(4), 364–368. https://doi.org/10.1017/S2045796016000809.
DeKeyser, R. (2007). Skill acquisition theory. In B. VanPatten, & J. Williams (Eds.), Theories in second language acquisition: An introduction (pp. 97–113). Lawrence Erlbaum Associates.
Gao, J. (2021). Exploring the feedback quality of an automated writing evaluation system pigai. International Journal of Emerging Technologies in Learning (IJET), 16(11), 322–330. https://doi.org/10.3991/ijet.v16i11.19657.
*Gao, J., & Ma, S. (2019). The effect of two forms of computer-automated metalinguistic corrective feedback. Language Learning & Technology, 23(2), 65–83. https://doi.org/10125/44683.
*Ghufron, M., & Rosyida, F. (2018). The role of grammarly in assessing English as a foreign language (EFL) writing. Lingua Cultura, 12(4), 395–403. https://doi.org/10.21512/lc.v12i4.4582.
Graham, S. (2019). Changing how writing is taught. Review of Research in Education, 43(1), 277–303. https://doi.org/10.
3102/0091732X18821125.
Grimes, D., & Warschauer, M. (2010). Utility in a fallible tool: A multi-site case study of automated writing evaluation. Journal of Technology, Learning, and Assessment, 8(6), 1–44. https://ejournals.bc.edu/index.php/jtla/article/view/1625 Harrer, M., Cuijpers, P., Furukawa, T., & Ebert, D. (2021). Doing meta-analysis with R: A hands-on guide (1st ed.). Chapman
& Hall/CRC Press (Taylor & Francis). https://doi.org/10.1201/9781003107347.
*Hassanzadeh, M., & Fotoohnejad, S. (2021). Implementing an automated feedback program for a foreign language writing course: A learner-centric study. Journal of Computer Assisted Learning, 37(5), 1494–1507. https://doi.org/10. 1111/jcal.12587.
Hedges, L., & Olkin, I. (1985). Statistical methods for meta-analysis. Acedamic Press, INC.
*Hoang, T. L. G. (2019). Examining automated corrective feedback in EFL writing classrooms: A case study of criterion
[Doctoral dissertation]. The University of Melbourne. UM Campus Repository. http://hdl.handle.net/11343/234481 Hocky, N. (2019). Automated writing evaluation. ELT Journal, 73(1), 82–88. https://doi.org/10.1093/elt/ccy044.
*Hou, Y. (2020). Implications of aes system of pigai for selfregulated learning. Theory and Practice in Language Studies, 10 (3), 261–268. https://doi.org/ 10.17507/tpls.1003.01.
*Hu, Y., & Zhao, D. (2015). A comparative study of teacher feedback and automated essay scoring in college English writing. International Journal of Linguistics and Communication, 3(2), 82–97. https://doi.org/10.15640/ijlc.v3n2a9.
*Huang, S., & Renandya, W. (2020). Exploring the integration of automated feedback among lower-proficiency EFL lear- ners. Innovation in Language Learning and Teaching, 14(1), 15–26. https://doi.org/10.1080/17501229.2018.1471083.
*Jayavalan, K., & Razali, A. (2018). Effectiveness of online grammar checker to improve secondary students’ English nar- rative essay writing. International Research Journal of Education and Sciences (IRJES), 2(1), 1–6. ISSN 2550-2158.
*Khoii, R., & Doroudian, A. (2013). Automated scoring of EFL learners’ written performance: A torture or a blessing?
Proceedings of ICT for Language Learning, 5146–5155.
*Lai, Y.-H. (2010). Which do students prefer to evaluate their essays: Peers or computer program. British Journal of Educational Technology, 41(3), 432–454. https://doi.org/10.1111/j.1467-8535.2009.00959.x.
Lee, H., Warschauer, M., & Lee, J. (2019). The effects of corpus use on second language vocabulary learning: A multilevel meta-analysis. Applied Linguistics, 40(5), 721–753. https://doi.org/10.1093/applin/amy012.
*Li, J., Link, S., & Hegelheimer, V. (2015). Rethinking the role of automated writing evaluation (AWE) feedback in ESL writing instruction. Journal of Second Language Writing, 27, 1–18. https://doi.org/ 10.1016/j.jslw.2014.10.004.
*Li, Z., Feng, H.-H., & Saricaoglu, A. (2017). The short-term and long-term effects of AWE feedback on ESL students’ development of grammatical accuracy. CALICO Journal, 34(3), 355–375. https://doi.org/10.1558/cj.26382.
*Li, Z., & Yan, D. (2020). Effect of pigai.org on English majors’ writing self-efficacy and writing performance. Journal of Physics: Conference Series, 1533, 1–6. https://doi.org/10.1088/1742-6596/1533/4/042086.
*Liao, H.-C. (2016a). Using automated writing evaluation to reduce grammar errors in writing. ELT Journal, 70(3), 308–
319. https://doi.org/10.1093/elt/ccv058.
*Liao, H.-C. (2016b). Enhancing the grammatical accuracy of EFL writing by using an AWE-assisted process approach.
System, 62, 77–92. https://doi.org/ 10.1016/j.system.2016.02.007.
*Liou, H.-C. (1993). Investigation of using text-critiquing programs in a process-oriented writing class. CALICO Journal, 10 (4), 17–38. https://doi.org/10.1558/cj.v10i4.17-38.
Lipsey, M., & Wilson, D. (2001). Practical meta-analysis. Sage.
*Liu, M., Li, Y., Xu, W., & Liu, L. (2017). Automated essay feedback generation and its impact on revision. IEEE Transactions on Learning Technologies, 10(4), 502–513. https://doi.org/10.1109/TLT.2016.2612659.
*Lu, Z., Li, X., & Li, Z. (2015). AWE-based corrective feedback on developing EFL learners’ writing skill. In F. Helm, L.
Bradley, M. Guarda, & S. Thouësny (Eds.), Critical CALL –proceedings of the 2015 EUROCALL conference, Padova, Italy
(pp. 375–380). Research-publishing.net. https://doi.org/10.14705/rpnet.2015.000361.
*Lu, Z., & Li, Z. (2016). Exploring EFL learners’ lexical application in AWE-based writing. In S. Papadima-Sophocleous, L. Bradley, & S. Thouësny (Eds.), Call communities and culture – short papers from EUROCALL 2016 (pp. 295–301). Research-publishing.net. https://doi.org/10.14705/rpnet.2016.eurocall2016.578.
*Ma, K. (2013). Improving EFL graduate students’ proficiency in writing through an online automated essay assessing system. English Language Teaching, 6(7), 158–167. https://doi.org/10.5539/elt.v6n7p158.
*Mohsen, M., & Alshahrani, A. (2019). The effectiveness of using a hybrid mode of automated writing evaluation system on efl students’ writing. Teaching English with Technology, 19(1), 118–131. ISSN: 1642-1027.
*Mørch, A. I., Engeness, I., Cheng, V. C., Cheung, W. K., & Wong, K. C. (2017). Essaycritic: Writing to learn with a knowl- edge-based design critiquing system. Educational Technology & Society, 20(2), 213–223.
Nunan, D. (1992). Research methods in language learning. Cambridge University Press.
Nunes, A., Cordeiro, C., Limpo, T., & Castro, S. (2021). Effectiveness of automated writing evaluation systems in school settings: A systematic review of studies from 2000 to 2020. Journal of Computer Assisted Learning, 38, 599–620. https://doi.org/10.1111/jcal.12635.
*Ohta, R. (2008). The impact of an automated evaluation system on student-writing performance. KATE Journal, 22, 23–
33. https://doi.org/10.20806/katejo.22.0_23.
O’Neill, R., & Russell, A. (2019). Stop! grammar time: University students’ perceptions of the automated feedback program grammarly. Australasian Journal of Educational Technology, 35(1), 42–56. https://doi.org/10.14742/ajet.3795.
*Park, J. (2020). Implications of AI-based grammar checker in EFL learning and testing: Korean high school students’ writing (Publication No. I804:11009-000000127977) [Master’s thesis]. Korea University. https://library.korea.ac.kr/detail/? cid=CAT000046022553&ctype=t&lang=en.
*Parra, G. L., & Calero, S. X. (2019). Automated writing evaluation tools in the improvement of the writing skill.
International Journal of Instruction, 12(2), 209–226. https://doi.org/10.29333/iji.2019.12214a.
Pawson, R., & Tilley, N. (2004). Realist evaluation. In H.-U. Otto, A. Polutta, & H. Ziegler (Eds.), Evidence-based practice: Modernising the knowledge base of social work? (pp. 151–182). Barbara Budrich.
*Rich, C. (2012). The impact of online automated writing evaluation: A case study from Dalian. Chinese Journal of Applied Linguistics, 35(1), 63–79. https://doi.org/10.1515/cjal-2012-0006.
*Salavatizadeh, M., & Tahriri, A. (2020). The effect of blended online automated feedback and teacher feedback on EFL learners’ essay writing ability and perception. Journal of Teaching Language Skills, 39(3.2), 181–225. https://doi.org/ 10.22099/jtls.2021.38753.2899.
*Saricaoglu, A., & Bilki, Z. (2021). Voluntary use of automated writing evaluation by content course students. ReCALL, 33, 265–277. https://doi.org/10.1017/S0958344021000021.
*Shang, H.-F. (2019). Exploring online peer feedback and automated corrective feedback on EFL writing performance.
Interactive Learning Environments, 30, 4–16. https://doi.org/10.1080/10494820.2019.1629601.
*Toranj, S., & Ansari, D. (2012). Automated versus human essay scoring: A comparative study. Theory and Practice in Language Studies, 2(4), 719–725. https://doi.org/10.4304/tpls.2.4.719-725.
Viechtbauer, W. (2010). Conducting meta-analyses in R with the metafor Package. Journal of Statistical Software, 36(3), 1–
48. https://doi.org/10.18637/jss.v036.i03.
Wang, Y. (2012). Differences in L1 and L2 academic writing. Theory and Practice in Language Studies, 2(3), 637–641. https://doi.org/10.4304/tpls.2.3.637-641.
*Wang, P.-L. (2013). Can automated writing evaluation programs help students improve their English writing?
International Journal of Applied Linguistics & English Literature, 2(1), 6–12. https://doi.org/ 10.7575/ijalel.v.2n.1p.6.
*Wang, S., & Xian, Y. (2011). A case study on the efficacy of error correction practice by using the automated writing evalu- ation system WRM 2.0 on Chinese college students’ English writing. 2011 International Conference on Computational and Information Sciences (pp. 988–991). Chengdu, China, 2011, October 21-23. https://doi.org/10.1109/ICIS.2011.21.
*Wang, Y.-J., Shang, H.-F., & Briody, P. (2013). Exploring the impact of using automated writing evaluation in English as a foreign language university students’ writing. Computer Assisted Language Learning, 26(3), 234–257. https://doi.org/ 10.1080/09588221.2012.655300.
Warschauer, M., & Ware, P. (2006). Automated writing evaluation: Defining the classroom research agenda. Language Teaching Research, 10(2), 157–180. https://doi.org/10.1191/1362168806lr190oa.
Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L. D., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686. Willson, V. L., & Putnam, R. R. (1982). A meta-analysis of pretest sensitization effects in experimental design. American
Educational Research Journal, 19(2), 249–258. https://doi.org/10.3102/00028312019002249.
*Xie, Y., Huang, L., & Wang, Y. (2020). The impact of AWE and peer feedback on Chinese EFL learners’ English writing performance. In L.-K. Lee, et al., L. U, F. Wang, S. Cheung, O. Au, & K. Li (Eds.), ICTE 2020, CCIS 1302 (pp. 258–270). https://doi.org/10.1007/978-981-33-4594-2_22.
*Yang, H. (2018). Efficiency of online grammar checker in English writing performance and students’ perceptions.
Korean Journal of English Language and Linguistics, 18(3), 328–348. https://doi.org/10.15738/kjell.18.3.201809.328.
*Zhang, L., Huang, Z. (2020). Effect of an automated writing evaluation system on students’ EFL writing performance. In H.-J. So, M. Rodrigo, J. Mason, & A. Mitrovic (Eds.), Proceedings of the 28th International Conference on computers in education (pp. 567–569).